174
Applications in Computer Vision
6.5.1
Preliminaries
In a specific convolution layer, w ∈RCout×Cin×K×K, ain ∈RCin×Win×Hin, and aout ∈
RCout×Wout×Hout represent its weights and feature maps, where Cin and Cout represents the
number of channels. (H, W) are the height and width of the feature maps, and K denotes
the size of the kernel. Then we have the following.
aout = ain ⊗w,
(6.78)
where ⊗is the convolution operation. We omit the batch normalization (BN) and ac-
tivation layers for simplicity. The 1-bit model aims to quantize w and ain into bw ∈
{−1, +1}Cout×Cin×K×K and bain ∈{−1, +1}Cin×H×W using efficient XNOR and Bit-count
operations to replace full-precision operations. Following [48], the forward process of the 1-
bit CNN is
aout = α ◦bain ⊙bw,
(6.79)
where ⊙is the XNOR, and bit-count operations, and ◦denotes channel-wise multiplication.
α = [α1, · · · , αCout] ∈R+ is the vector consisting of channel-wise scale factors. b = sign(·)
denotes the binarized variable using the sign function, which returns 1 if the input is greater
than zero and -1 otherwise. It then enters several non-linear layers, e.g., BN layer, non-
linear activation layer, and the max-pooling layer. We omit these for simplification. Then,
the output aout is binarized to baout via the sign function. The fundamental objective of
BNNs is to calculate w. We want it to be as close as possible before and after binarization
to minimize the binarization effect. Then, we define the reconstruction error as
LR(w, α) = w −α ◦bw.
(6.80)
6.5.2
Select Proposals with Information Discrepancy
To eliminate the large magnitude scale difference between the real valued teacher and the
1-bit student, we introduce a channelwise transformation for the proposals1 of the inter-
mediate neck. We first apply a transformation ϕ(·) on a proposal ˜Rn ∈RC×W ×H and
have
Rn;c(x, y) = ϕ( ˜Rn;c(x, y)) =
exp(
˜
Rn;c(x,y)
T
)
(x′,y′)∈(W,H) exp(
˜
Rn;c(x′y′)
T
)
,
(6.81)
where (x, y) ∈(W, H) denotes a specific spatial location (x, y) in the spatial range (W, H),
and c ∈{1, · · · , C} is the channel index. n ∈{1, · · · , N} is the proposal index. N denotes the
number of proposals. T denotes a hyper-parameter controlling the statistical attributions
of the channel-wise alignment operation2. After the transformation, the features in each
channel of a proposal are projected into the same feature space [231] and follow a Gaussian
distribution as
p(Rn;c) ∼N(μn;c, σ2
n;c).
(6.82)
We further evaluate the information discrepancy between the teacher and the student
proposals. As shown in Fig. 6.16, the teacher and the student have NT and NS proposals,
respectively. Every proposal in one model generates a counterpart feature map patch in the
same location as in the other model. Thus, total NT + NS proposal pairs are considered.
To evaluate the information discrepancy, we introduce the Mahalanobis distance of each
1In this paper, the proposal denotes the neck/backbone feature map patched by the region proposal of
detectors.
2In this section, we set T = 4.